Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data

نویسندگان

Dmitry Pavlov

Heikki Mannila

Padhraic Smyth

چکیده

We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy method, and the itemset inclusion-exclusion model. In the maximum entropy method we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model itemsets and their frequencies are stored in a data structure called an ADtree that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental tradeoffs between approximation error, model complexity, and the online time required to compute a query answer.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Models for Query Approximation with LargeSparse Binary Data

Large sparse sets of binary transaction data with millions of records and thousands of attributes occur in various domains: customers purchasing products, users visiting web pages, and documents containing words are just three typical examples. Real-time query selectivity estimation (the problem of estimating the number of rows in the data satisfying a given predicate) is an important task for ...

متن کامل

Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets

Large sparse sets of binary transaction data with millions of records and thousands of attributes occur in various domains: cus tomers purchasing products, users visiting web pages, and documents containing words are just three typical examples. Real-time query selectivity estimation (the problem of estimating the number of rows in the data satisfying a given predicate) is an important practic...

متن کامل

Learning Generalization Query Models for Transaction Data

Interactive querying of massive data sets is an increasingly important application. Existing techniques in the database literature have focused on producing fast approximations to exact data counts, using (for example) independence models. In this paper we examine generalization queries, e.g., the problem of answering queries in real-time that generalize to new data rather than just providing c...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل